Experimental Min Nan extraction function #397

lfashby · 2021-03-27T16:05:03Z

Updated Unreleased in CHANGELOG.md to reflect the changes in code or data.

This adds an (incomplete) Min Nan extraction function in the hopes that it will nudge us toward developing solutions to #259 and #329. This extraction function currently only targets entries from the Hokkien 'dialect' (just because it seemed the most prevalent) and a user can then specify 'subdialects' of Hokkien with --dialect. I've added some data with (sub)dialect set as Xiamen.

To improve/expand the coverage of this extraction function we need to settle on a solution to #329. One solution might be to add a --subdialect option and have --dialect be used for Hokkein/Teochew and --subdialect for the nested dialects like Xiamen/Taipei (so for Portuguese we could set --dialect as Brazil and --subdialect as Paulista/South Brazil). This would be easy to implement but would probably be confusing to users. Also it would require users (or people running the big scrape) to scrape the same language a whole bunch of separate times if they were interested in getting data from all the dialects and subdialects of that language in separate TSVs.

An alternative solution would be to try and revamp our dialects system somewhat like so: If a user runs something like wikipron nan --dialects we run through the language once and write as many TSVs as there are dialects/subdialects in that language (easier said than done). If a user runs wikipron nan we run through the language once and they get one TSV containing all the entries from all the dialects/subdialects. I think there are different ways of doing this but the coolest would be to 'discover' the dialects/subdialects as we scrape a language and automatically write them to different TSVs.

kylebgorman

This looks good to me.

I am fine with the --subdialect proposal but agree that --dialects is a superior solution. It should also reduce our need to micromanage this all, right?

jacksonllee

LGTM. Kudos to you, Lucas, for working on these challenges! Understood that this Min Nan scrape was arbitrarily set for Hokkien for now, and is subject to change in future PRs based on how we want to handle subdialects.

The --dialects proposal does sound great (might need another flag name -- too easily confused with the existing --dialect), but I see implementation may be tricky (which Lucas has already alluded to). So far we've seen the Chinese-styled and Brazilian Portuguese-styled subdialect formatting on Wiktionary. Are there other flavors we haven't come across yet?

jacksonllee · 2021-03-27T16:36:21Z

data/scrape/lib/languages.json

+            "zyyy": "Common",
+            "latn": "Latin",
+            "hira": "Hiragana",
+            "hani": "Han"


I need a quick reminder -- where do these come from again?

From our new languages_update.py postprocessing step.

lfashby · 2021-03-27T23:17:22Z

It should also reduce our need to micromanage this all, right?

Certainly, though perhaps by introducing something even more demanding of micromanagement!

So far we've seen the Chinese-styled and Brazilian Portuguese-styled subdialect formatting on Wiktionary. Are there other flavors we haven't come across yet?

Not that I'm aware of, though it'd definitely be best to go on a bit of a hunt for them before trying out this dialects approach.

lfashby added 8 commits March 25, 2021 15:41

somewhat acceptable Min Nan extractor

909eff6

raw nan scrape before internet cut out

1485539

raw nan scrape

30c6d72

nan postprocessing, big scrape readme fixes

cf419ef

updates tests

dba33b2

Merge branch 'master' into min

7d669d6

cleanup test_scrape

cbc4131

updates changelog

b6eb980

lfashby requested review from kylebgorman and jacksonllee March 27, 2021 16:13

kylebgorman approved these changes Mar 27, 2021

View reviewed changes

jacksonllee approved these changes Mar 27, 2021

View reviewed changes

lfashby merged commit db093e6 into CUNY-CL:master Mar 27, 2021

lfashby deleted the min branch September 13, 2021 19:24

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Experimental Min Nan extraction function #397

Experimental Min Nan extraction function #397

lfashby commented Mar 27, 2021 •

edited

Loading

kylebgorman left a comment

jacksonllee left a comment

jacksonllee Mar 27, 2021

lfashby Mar 27, 2021

lfashby commented Mar 27, 2021

Experimental Min Nan extraction function #397

Experimental Min Nan extraction function #397

Conversation

lfashby commented Mar 27, 2021 • edited Loading

kylebgorman left a comment

Choose a reason for hiding this comment

jacksonllee left a comment

Choose a reason for hiding this comment

jacksonllee Mar 27, 2021

Choose a reason for hiding this comment

lfashby Mar 27, 2021

Choose a reason for hiding this comment

lfashby commented Mar 27, 2021

lfashby commented Mar 27, 2021 •

edited

Loading